-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring
chart
#3949
Conversation
Validation steps
Ex:- longhorn-controller: repository: rancher/hardened-sriov-cni tag: v2.6.3-build20230913
|
Validation steps
Ex:- longhorn-controller: repository: rancher/hardened-sriov-cni tag: v2.6.3-build20230913
|
Validation steps
Ex:- longhorn-controller: repository: rancher/hardened-sriov-cni tag: v2.6.3-build20230913
|
Validation steps
Ex:- longhorn-controller: repository: rancher/hardened-sriov-cni tag: v2.6.3-build20230913
|
rancher-monitoring
chartrancher-monitoring
chart
rancher-monitoring
chartrancher-monitoring
chart
Test reportApart that rancher-monitoring As a workaround I did an upgrade from previous version Environment
Test
|
Validation steps
Ex:- longhorn-controller: repository: rancher/hardened-sriov-cni tag: v2.6.3-build20230913
|
The installation problem of 104 has been fixed by #4026 Now I could install the version directly and the fleet target is there. |
…er-monitoring` chart (rancher#3949)
Relates to rancher/fleet#2460 |
…er-monitoring` chart (rancher#3949)
Issue:
The direct issue is this one: rancher/fleet#2295
The whole story is here: rancher/fleet#1408
The PR that introduced metrics into fleet: rancher/fleet#2172
The changes have been merged into Fleet v0.10.0-rc.13. Fleet 0.10 is planned to be released with Rancher 2.9.
Problem
Enabling further additions to monitoring that are related to the newly introduced fleet metrics, for which reasons Prometheus needs to scrape the data of the fleet-controllers by creating an additional ServiceMonitor which points to the Kubernetes services created by the fleet chart, which in turn point to the fleet-controller metrics.
Solution
An additional ServiceMonitor needs to be created when the
rancher-monitoring
chart is installed, so that the thereby installed Prometheus instance is automatically configured to scrape the data of the fleet-controllers.This enables further additions of monitoring capabilities to Rancher using the
rancher-monitoring
chart, for instance the addition of Prometheus alerts or Grafana dashboards. The latter may be embedded into Rancher, similarly as the Grafana dashboards are already embedded into Rancher and displayed through the Rancher UI when therancher-monitoring
chart is installed.Testing
On a cluster with Rancher and a fleet version >= v0.10.0-rc13, install the
rancher-monitoring
chart that includes the changes of this PR.Open the Prometheus UI, navigate to
Targets
and check for fleet-controller.If metrics are to be tested with sharding in Fleet enabled, which also is a feature introduced first in v0.10.0-rc.13, make sure you use a fleet version which has metrics: make sure metrics work well with sharding fleet#2420 integrated, which, at the time of writing is not yet in an RC of fleet. Also, fleet needs to be deployed with sharding enabled as described in the fleet-docs.
Engineering Testing
Manual Testing
Performed as described in
Testing
, including testing with sharding enabled in fleet.Automated Testing
The initial PR adds E2E tests that check the fleet-controller exposed metrics through the helm chart generated services (when fleet is installed). Those tests do not cover the usage of a ServiceMonitor as introduced in this PR. Further PRs have followed to extend and improve testing of metrics in fleet:
QA Testing Considerations
Regressions Considerations
The probability of this change introducing regressions is low, as it simply extends already implemented functionality by a rather simple resource, which is part of the
rancher-monitoring-crd
chart.For some more context, the
ServiceMonitor
is a custom Kubernetes resources and part of theprometheus-operator
controller. The controller looks at the resource and configures Prometheus to scrape an additional target, which in this case will be fleet. If anything inside this resource is wrong, it is not expected to have an effect on any other resources of the same kind. It would be surprising to see that scraping these amounts of additional metrics would have a significant performance impact, but looking at it long-term could potentially increase the storage space required for storing metrics. That said, Prometheus is by default configured to retain the data for only 15 days (and a default retention size in therancher-monitoring
chart of 50G), so that this aspect should also be negligible. The scraped metrics could potentially conflict with other metrics and cause a mess, for which reason they are prefixed withfleet_
, making conflicts virtually impossible.Backporting considerations
This change does not need to be backported to other versions. This is a new feature in fleet and no plans exist to backport it.
The probability of this change introducing regressions is low, as it simply extends already implemented functionality by a rather simple resource, which is part of the
rancher-monitoring-crd
chart.Backporting considerations
This change does not need to be backported to other versions. This is a new feature in fleet and no plans exist to backport it.